Although K-Nearest Neigbours (KNN) is often regarded as very simple machine learning algorithm, its utility and power are undeniable. It is one of the core algorithms for supervised learning. Simply put, supervised learning is proccess of creating model that can predict value of target variable based on input data, using knowledge from dataset where we know the actual values of the target variables. KNN can be effectively used for both classification (target variable can take a limited number of values) and regression (target variable can take on continuous range of values) tasks. The simplest explanation of KNN for classification tasks is that an object is classifed by plurality vote of its k nearest neighbours. For regression tasks, KNN can be generalized so that an object is assigned value that is the average of the values of its k nearest neigbours. However, this is just a basic overview and there are several factors to consider when using KNN to develop an effective machine learning model.
Dataset description
Before diving into KNN, let’s briefly review the dataset that we chose to demonstrate model training with real data. The dataset that we selected is Air Quality and Pollution Assesment This Dataset is derived from World Health Organization and World Bank Group This Dataset contains several features, in other words, columns, lets go through each one of them and explain what they mean.
Temperature(°C): Average temperature of the region
Humidity (%): Relative humidity recorded in the region
PM2.5 Concentration (µg/m³): Fine particulate matter level
Proximity to Industrial Areas (km): Distance to the nearest industrial zone
Population Density (people/km²): Number of people per square kilometer in the region
Then there is so called Target Variable, that’s the variable that we are trying to predict, in our dataset, it is called Air Quality and it can have 4 possible values depending on Air Quality, these values are the following:
Good: Clean air with low pollution levels.
Moderate: Acceptable air quality but with some pollutants present.
Poor: Noticeable pollution that may cause health issues for sensitive groups.
Hazardous: Highly polluted air posing serious health risks to the population.n.
Importing all libraries that will be used
Code
import numpy as npimport matplotlib.pyplot as pltimport pandas as pdfrom sklearn.model_selection import train_test_split, ParameterGridfrom sklearn import metricsimport numpy as npimport pandas as pdimport matplotlib.pyplot as pltfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.preprocessing import StandardScaler, MinMaxScalerimport plotly.express as pximport ipywidgets as widgetsfrom ipywidgets import interactiveimport seaborn as snsimport matplotlib.patheffects as path_effects
Preprocessing data
As part of data preprocessing we read data from csv file and then we preprocess them. Considering that there are no missing values, there is no need to fill them. All features are already numerical, so there is no need to convert them any further. Target variable can take on 4 values, and there is order between those values (it is ordinal categorical datatype), so we convert it into categorical type with order between them. We split data into 3 parts, train, validate and test with train size being 60% of the original dataset and validate and test both being 20% of the original dataset
Code
def preprocess_data(df:pd.DataFrame)->pd.DataFrame:""" Function, for preprocessing data """ qual_category = pd.api.types.CategoricalDtype(categories=['Hazardous', 'Poor', 'Moderate', 'Good'], ordered=True) df['Air Quality'] = df['Air Quality'].astype(qual_category)return df
Code
def read_data(path:str='data/data.csv', y:str='Air Quality',**kwargs)->tuple:""" Function thats read data, and splits them into Train, Validation and Test datasets also separates, target value from others values. --- Attributes: path: [str], path to csv data file y: [str], name of Target value kwargs: options, use seed for random_seed --- Returns: tuple with Train,Validation,Test parametrs set, Target values: Train, Test, Validation """ df = pd.read_csv(path) display(df.info()) display(df.describe()) df = preprocess_data(df)# Split the training dataset into train and rest (default 60% : 40%) Xtrain, Xrest, ytrain, yrest = train_test_split( df.drop(columns=[y]), df[y], test_size=0.4, random_state=kwargs.get('seed',42))# Split the rest of the data into validation dataset and test dataset (default: 24% : 16%) Xtest, Xval, ytest, yval = train_test_split( Xrest, yrest, test_size=0.5, random_state=kwargs.get('seed',42))print(f'Dataset: {path} | Target value: {y} | Seed: {kwargs.get('seed',42)}')return Xtrain, Xtest, Xval, ytrain, ytest, yval
Dataset: data/data.csv | Target value: Air Quality | Seed: 42
Train data analyzation
Before training the model on the train part of the dataset, we can try to look at the values to learn a little bit more about the dataset. The reason we only explore the train part of the dataset is that we don’t look at values from validating and test datasets because we want to treat them as “new” data that the model has not seen so that they can be used for accuracy estimates.
Code
df_original = Xtraindf_tmp = ytrain.copy()df_tmp = df_tmp.astype("category")df_tmp_counts = df_tmp.value_counts()custom_colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728']fig = plt.figure(figsize=(15,8))ax1 = fig.add_subplot(1,2,1)sns.barplot(x = df_tmp_counts.index, y = df_tmp_counts.values, ax = ax1, order=df_tmp_counts.index)for i, bar inenumerate(ax1.patches): bar.set_facecolor(custom_colors[i])ax1.set_xlabel("Air quality type", fontsize =13)ax1.set_ylabel("Frequency",fontsize =13)ax1.set_title("Air quality type Frequency in Train dataset",fontsize =17)ax1.grid(axis='y', color='black', alpha=.2, linewidth=.5)animal_count = df_tmp.value_counts()ax2 = fig.add_subplot(1,2,2)ax2.pie(animal_count, labels=animal_count.index,autopct='%.0f%%', textprops={"fontsize": 13})ax2.set_title("Air quality type distribution in Train dataset", fontsize =17)
Text(0.5, 1.0, 'Air quality type distribution in Train dataset')
fig, axes = plt.subplots(2, 2, figsize=(15, 15))ax = axes[0,0]# Plot the histogramcounts, bins, patches = ax.hist(df_original["Temperature"], edgecolor="black", color="skyblue")ax.set_xticks(bins) ax.set_xticklabels([round(x) for x in bins])# Set titles and labelsax.set_title("Temperature histogram (Train)", fontsize =17)ax.set_xlabel("Temperature (°C)", fontsize =13)ax.set_ylabel("Frequency", fontsize =13)for i, count inenumerate(counts): ax.text(bins[i]+2.5, count +10, int(count), ha="center", va="bottom")ax = axes[0,1]# Plot the histogramcounts, bins, patches = ax.hist(df_original["Humidity"], edgecolor="black", color="skyblue")ax.set_xticks(bins) ax.set_xticklabels([int(x) for x in bins])# Set titles and labelsax.set_title("Humidity histogram (Train)", fontsize =17)ax.set_xlabel("Humidity (%)", fontsize =13)ax.set_ylabel("Frequency", fontsize =13)for i, count inenumerate(counts): ax.text(bins[i]+4.5, count +10, int(count), ha="center", va="bottom")ax = axes[1,0]# Plot the histogramcounts, bins, patches = ax.hist(df_original["PM2.5"], edgecolor="black", color="skyblue")ax.set_xticks(bins) ax.set_xticklabels([int(x) for x in bins])# Set titles and labelsax.set_title("PM2.5 histogram (Train)", fontsize =17)ax.set_xlabel("PM2.5 (µg/m³)", fontsize =13)ax.set_ylabel("Frequency", fontsize =13)for i, count inenumerate(counts): ax.text(bins[i]+15.5, count +10, int(count), ha="center", va="bottom")ax = axes[1,1]# Plot the histogramcounts, bins, patches = ax.hist(df_original["PM10"], edgecolor="black", color="skyblue")ax.set_xticks(bins) ax.set_xticklabels([int(x) for x in bins])# Set titles and labelsax.set_title("PM10 histogram (Train)", fontsize =17)ax.set_xlabel("PM10 (µg/m³)", fontsize =13)ax.set_ylabel("Frequency", fontsize =13)for i, count inenumerate(counts): ax.text(bins[i]+15.5, count +10, int(count), ha="center", va="bottom")
Code
fig, axes = plt.subplots(2, 2, figsize=(15, 15))ax = axes[0,0]# Plot the histogramcounts, bins, patches = ax.hist(df_original["NO2"], edgecolor="black", color="skyblue")ax.set_xticks(bins) ax.set_xticklabels([int(x) for x in bins])# Set titles and labelsax.set_title("NO2 histogram (Train)", fontsize =17)ax.set_xlabel("NO2 (ppb)", fontsize =13)ax.set_ylabel("Frequency", fontsize =13)for i, count inenumerate(counts): ax.text(bins[i]+2.5, count +10, int(count), ha="center", va="bottom")ax = axes[0,1]# Plot the histogramcounts, bins, patches = ax.hist(df_original["SO2"], edgecolor="black", color="skyblue")ax.set_xticks(bins) ax.set_xticklabels([int(x) for x in bins])# Set titles and labelsax.set_title("SO2 histogram (Train)", fontsize =17)ax.set_xlabel("SO2 (ppb)", fontsize =13)ax.set_ylabel("Frequency", fontsize =13)for i, count inenumerate(counts): ax.text(bins[i]+2.5, count +10, int(count), ha="center", va="bottom")ax = axes[1,0]# Plot the histogramcounts, bins, patches = ax.hist(df_original["CO"], edgecolor="black", color="skyblue")ax.set_xticks(bins) ax.set_xticklabels([f'{x:.1f}'for x in bins]) # Set titles and labelsax.set_title("CO histogram (Train)", fontsize =17)ax.set_xlabel("CO (ppm)", fontsize =13)ax.set_ylabel("Frequency", fontsize =13)for i, count inenumerate(counts): ax.text(bins[i] +0.15, count +10, int(count), ha="center", va="bottom")ax = axes[1,1]# Plot the histogramcounts, bins, patches = ax.hist(df_original["Proximity_to_Industrial_Areas"], edgecolor="black", color="skyblue")ax.set_xticks(bins) ax.set_xticklabels([int(x) for x in bins])# Set titles and labelsax.set_title("Proximity to Industrial Areas histogram (Train)", fontsize =17)ax.set_xlabel("Proximity to Industrial Areas (km)", fontsize =13)ax.set_ylabel("Frequency", fontsize =13)for i, count inenumerate(counts): ax.text(bins[i] + (bins[i+1] - bins[i])/2, count +10, int(count), ha="center", va="bottom")
Code
fig, axes = plt.subplots(2, 2, figsize=(13, 13))ax = axes[0,0]# Plot the histogramcounts, bins, patches = ax.hist(df_original["Population_Density"], edgecolor="black", color="skyblue")ax.set_xticks(bins) ax.set_xticklabels([int(x) for x in bins])# Set titles and labelsax.set_title("Population Density histogram (Train)", fontsize =17)ax.set_xlabel("Population Density (people/km²)", fontsize =13)ax.set_ylabel("Frequency", fontsize =13)for i, count inenumerate(counts): ax.text(bins[i]+28.5, count +5, int(count), ha="center", va="bottom")axes[0,1].set_visible(False)axes[1,1].set_visible(False)axes[1,0].set_visible(False)
Training the model
We are trying to solve supervised learning problem, where based on p features \(X_{1}, ..., X_{p}\) (Temperature, Humidity, …) we want to predict the value of target variable \(Y\) (Air Quality). We can put all of these features into vector \(\textbf{X} = (X_{1}, ... X{_p})^T\) which we will interpret as a random vector, and one of its specific realization will be denoted as \(\mathbf{x}\). Let \(\mathcal{X}\) where \(\textbf{x} \in \mathcal{X}\) be set containing all possible values of these features, typically, and in our case, \(\mathcal{X} = \mathbb{R}^p\). Then we can can describe our training dataset as N pairs \((\textbf{x}_{1}, Y_{1}), ..., (\textbf{x}_{N}, Y_{N})\) where \(\textbf{x}_{i}\) is vector of features and \(Y_{i}\) is target variable. The basic concept of predicting value of data point \(\mathbf{x} \in \mathcal{X}\) is finding the k-closest points from it (these points come from the train dataset). If we are solving regression problems, we then take the average from the target varibles of the k-closest neighbours. In classification problems (that’s our case) we take the most frequent value from the target variables of the k-closest neighbours.
When trying to find the best model, there are many things to consider. First of all we should define distance function. Distance also called metric defined on set \(\mathcal{X}\) is function d: \(\mathcal{X} \times \mathcal{X} \rightarrow [0, +\infty)\) such that for all \(x, y, z \in \mathcal{X}\), the following conditions holds:
\(\text{i) } d(x,y) \geq 0\), and \(d(x,y) = 0\) if and only if \(x = y\) (positive definiteness) \(\text{ii) }d(x,y) = d(y,x)\) (symmetry) \(\text{iii) }d(x,y) \geq d(x,z) + d(z,y)\) (triangle inequality)
The most common distance function used is KNN is Minkowski distance also called q-norm defined as follows: \(||x-y||_{q} = d_{q}(x,y) = \sqrt[q]{\sum_{i=1}^{p} (x_i - y_i)^q}\)
After we have found k-nearest neighbours, we need to somehow decide what value we should predict. In our case, we could just precict the value with the most frequent target variable. What we can also do is try weighted voting, that means that closer neighbours will have vote with higher value.
Common choice for weight is inverse distance weighting \(w_{i} = \frac{1}{d(x,y)}\)
For each category \(c\) that target variable can be, we count the total weighted vote \(W_{c} = \sum_{Y_i \in \mathcal{N}} w_i \cdot 1(Y_i = c)\)
where \(\mathcal{N}\) is set of target variables of k-nearest neighbours. We then choose \(\hat{y} = \arg\max_c W_c\), where \(c\) is every possible category of target variable, as our prediction.
Code
def train_model(*args,**kwargs):"""Function, that trains model with specific paraketrs, defined in kwargs. --- Attributes: *args: Xtrain,Ytrain **kwargs: options for training model --- Return: trained model """ X = args[0] y = args[1]return KNeighborsClassifier(**kwargs).fit(X,y)
Finding the Best model (Hyperparameter tuning)
When finding the best model, there are many things to consider. Our goal is to somehow maximize the classification accuracy on new data, that the model has never seen before. Accuracy is defined as \(\frac{ \text{number of correctly classified}}{ \text{number of all classified}}\). The accuracy which we are trying to maximize is the accuracy of how our model built using train dataset will predict on validating dataset. In order to maximize this accuracy, we can try changing the number of neighbours (k), then we can try changing the distance metric, lastly, we can try whether using weighted distance makes any difference. We build our model using combination of these different possibilites and for every one of them we meassure the accuracy on validating dataset. We then choose the model with the highest accuracy.
Code
def find_best(*args):""" Function finding best parametrs for specified data --- Attributes: Xtrain, ytrain, Xval, Yval """ param_grid = {'n_neighbors': range(1,15), 'p': range(1,4),'weights': ['uniform', 'distance'] } param_comb = ParameterGrid(param_grid) val_acc = [] train_acc = []for params in param_comb: clfKNN = KNeighborsClassifier(**params) clfKNN.fit(args[0], args[1]) train_acc.append(clfKNN.score(args[0],args[1])) val_acc.append(clfKNN.score(args[2],args[3]))return param_comb[np.argmax(val_acc)], train_acc, val_acc
fig, ax = plt.subplots(figsize=(15,5))ax.set_title("Model accuracy on Training and Validating datasets based on hyperparameter index (no normalization)", fontsize=17)ax.plot(train_acc,'or-')ax.plot(val_acc,'ob-')max_idx = np.argmax(val_acc)ax.plot(max_idx, val_acc[max_idx], marker='o', color='#77DD77', markersize=13, markeredgecolor='black', markeredgewidth=2)ax.set_ylim(bottom=0.67, top=1.02)ax.set_xlabel('Hyperparameter index', fontsize=13)ax.set_ylabel('Accuracy', fontsize=13)ax.legend(['Train', 'Validate', 'Best validate accuracy'], loc='lower right')
Normalization
One technique that can also help with making the model more accurate is normalization. In some cases, the features are difficult to compare and it is essentially impossible to find a universal measure. This can be addressed by using clever methods to normalize the data using linear transformation. The simplest method used to normalize data is called Min-Max normalization which scales every element in selected feature into interval \([0,1]\). It is defined as follows:
For given feature let’s call its minimal value \(\text{min}_x\) and the maximal value \(\text{max}_x\), then we will define the value \(x_i\) of this feature as \(x_i \leftarrow \frac{ x_i - \text{min}_x}{ \text{max}_i - \text{min}_x}\)
Another commonly used normalization method is standardization, defined as follows \(x_i \leftarrow \frac{x_i - \bar{x}}{\sqrt{s_x^2}}\) where \(\bar{x} = \frac{1}{n} \sum_{i} x_i\) is sample mean and \(s_x^2 = \frac{1}{n-1} \sum_{i}^{} {(x_i - \bar{x})^2}\) is sample variance
Standardization accuracy
Using Standardization, the best accuracy we get is 0.931
import plotly.graph_objects as goimport pandas as pdimport numpy as npfrom sklearn.preprocessing import StandardScaler, MinMaxScalerfrom sklearn.neighbors import KNeighborsClassifierXtrain, Xtest, Xval, ytrain, ytest, yval = read_data(seed=42)# ----------------------------------------------------# Dummy data and simulation of model prediction# (Replace these with your actual data and train_model function)# ----------------------------------------------------# For demonstration, create a dummy validation setnp.random.seed(0)N_val =50Xval = pd.DataFrame({'Proximity_to_Industrial_Areas': np.sort(np.random.rand(N_val) *10),'CO': np.random.rand(N_val) *0.5+0.2})# Define color mapping and category order (as in your original code)color_map = {'Good': 'green','Moderate': 'yellow','Poor': 'orange','Hazardous': 'red'}category_order = ["Good", "Moderate", "Poor", "Hazardous"]# Dummy function to simulate predictions.# In your actual code, you would train your model (using n_neighbors, p, scaler, weights)# and then compute predictions on Xval.def simulate_prediction(n_neighbors, p, scaler_option, weight):# Apply the scaler if neededif scaler_option =='Standard': scaler = StandardScaler() Xtrain_scaled = scaler.fit_transform(Xtrain.copy()) # Fit on training data Xval_scaled = scaler.transform(Xval.copy()) # Use transform on validation dataelif scaler_option =='MinMax': scaler = MinMaxScaler() Xtrain_scaled = scaler.fit_transform(Xtrain.copy()) # Fit on training data Xval_scaled = scaler.transform(Xval.copy()) # Use transform on validation dataelse: Xtrain_scaled = Xtrain.copy() Xval_scaled = Xval.copy() # No scaling# Train your KNN model with the parameters knn_model = KNeighborsClassifier(n_neighbors=n_neighbors, p=p, weights=weight) knn_model.fit(Xtrain_scaled, ytrain) # Replace `ytrain` with your actual target variable# Make predictions using the model predictions = knn_model.predict(Xval_scaled)return predictions# ----------------------------------------------------# Precompute frames for each parameter combination# ----------------------------------------------------n_neighbors_list = [1, 6, 11, 16, 20] # 5 values (instead of 1-20)p_list = [1, 2] # 2 valuesscaler_list = ['None', 'Standard', 'MinMax'] # 3 optionsweights_list = ['uniform', 'distance'] # 2 optionsframes = [] # List to hold all precomputed framesfor n in n_neighbors_list:for p_val in p_list:for scaler_option in scaler_list:for weight in weights_list:# Simulate predictions (replace with your model predictions) pred = simulate_prediction(n, p_val, scaler_option, weight)# Create a copy of Xval and add predictions as a new column df_viz = Xval.copy() df_viz['Air Quality'] = pred# Create a list of scatter traces—one per category—to mimic your px.scatter output traces = []for cat in category_order: subset = df_viz[df_viz['Air Quality'] == cat] trace = go.Scatter( x = subset['Proximity_to_Industrial_Areas'], y = subset['CO'], mode ='markers', marker =dict(color=color_map[cat], size=10), name = cat, showlegend =True ) traces.append(trace)# Create a frame with a name encoding the parameter values frame_name =f"n:{n} | p:{p_val} | scaler:{scaler_option} | weight:{weight}" frames.append(go.Frame(data=traces, name=frame_name))# ----------------------------------------------------# Build the figure using the default parameters# ----------------------------------------------------default_frame = frames[0]fig = go.Figure( data=default_frame.data, frames=frames)fig.update_layout( title=dict(text="Interactive KNN Visualization", font=dict(size=27)), xaxis=dict(title="Proximity to Industrial Areas (km)", titlefont=dict(size=19)), yaxis=dict(title="CO (ppm)", titlefont=dict(size=19)), height=550)# ----------------------------------------------------# Slider for number of neighbors# ----------------------------------------------------n_neighbors_steps = []for n in n_neighbors_list: step =dict( method="animate", args=[ [f"n:{n} | p:{p_val} | scaler:{scaler_option} | weight:{weight}"for p_val in p_list for scaler_option in scaler_list for weight in weights_list], {"mode": "immediate", "frame": {"duration": 300, "redraw": True}, "transition": {"duration": 300}} ], label=f"{n} neighbors"# The label shows the number of neighbors ) n_neighbors_steps.append(step)n_neighbors_slider =dict( active=0, currentvalue={"prefix": "Number of Neighbors: "}, pad={"t": 50}, steps=n_neighbors_steps)# ----------------------------------------------------# Slider for p metric (distance metric)# ----------------------------------------------------p_steps = []for p_val in p_list: step =dict( method="animate", args=[ [f"n:{n} | p:{p_val} | scaler:{scaler_option} | weight:{weight}"for n in n_neighbors_list for scaler_option in scaler_list for weight in weights_list], {"mode": "immediate", "frame": {"duration": 300, "redraw": True}, "transition": {"duration": 300}} ], label=f"p = {p_val}"# The label shows the p metric value ) p_steps.append(step)p_slider =dict( active=0, currentvalue={"prefix": "Distance Metric (p): "}, pad={"t": 50}, steps=p_steps)# ----------------------------------------------------# Buttons for normalization (scaler) and weight# ----------------------------------------------------scaler_buttons = []for scaler_option in scaler_list: button =dict( label=scaler_option, method="animate", args=[ [f"n:{n} | p:{p_val} | scaler:{scaler_option} | weight:{weight}"for n in n_neighbors_list for p_val in p_list for weight in weights_list], {"mode": "immediate", "frame": {"duration": 300, "redraw": True}, "transition": {"duration": 300}} ] ) scaler_buttons.append(button)weight_buttons = []for weight in weights_list: button =dict( label=weight, method="animate", args=[ [f"n:{n} | p:{p_val} | scaler:{scaler_option} | weight:{weight}"for n in n_neighbors_list for p_val in p_list for scaler_option in scaler_list], {"mode": "immediate", "frame": {"duration": 300, "redraw": True}, "transition": {"duration": 300}} ] ) weight_buttons.append(button)# Create layout for the buttonsscaler_buttons_layout =dict( x=0.1, y=-0.25, showactive=True, buttons=scaler_buttons)weight_buttons_layout =dict( x=0.1, y=-0.35, showactive=True, buttons=weight_buttons)# ----------------------------------------------------# Update the layout with the sliders and buttons# ----------------------------------------------------fig.update_layout( sliders=[n_neighbors_slider, p_slider], updatemenus=[scaler_buttons_layout, weight_buttons_layout])# ----------------------------------------------------# Render the interactive Plotly figure.# ----------------------------------------------------fig.show()
Dataset: data/data.csv | Target value: Air Quality | Seed: 42
---------------------------------------------------------------------------ValueError Traceback (most recent call last)
Cell In[35], line 75 72for scaler_option in scaler_list:
73for weight in weights_list:
74# Simulate predictions (replace with your model predictions)---> 75 pred = simulate_prediction(n, p_val, scaler_option, weight)
76# Create a copy of Xval and add predictions as a new column 77 df_viz = Xval.copy()
Cell In[35], line 55, in simulate_prediction(n_neighbors, p, scaler_option, weight) 52 knn_model.fit(Xtrain_scaled, ytrain) # Replace `ytrain` with your actual target variable 54# Make predictions using the model---> 55 predictions = knn_model.predict(Xval_scaled)
56return predictions
File ~\anaconda3\Lib\site-packages\sklearn\neighbors\_classification.py:271, in KNeighborsClassifier.predict(self, X) 268returnself.classes_[np.argmax(probabilities, axis=1)]
269# In that case, we do not need the distances to perform 270# the weighting so we do not compute them.--> 271 neigh_ind =self.kneighbors(X, return_distance=False)
272 neigh_dist =None 273else:
File ~\anaconda3\Lib\site-packages\sklearn\neighbors\_base.py:826, in KNeighborsMixin.kneighbors(self, X, n_neighbors, return_distance) 824 X = _check_precomputed(X)
825else:
--> 826 X =self._validate_data(X, accept_sparse="csr", reset=False, order="C")
828 n_samples_fit =self.n_samples_fit_
829if n_neighbors > n_samples_fit:
File ~\anaconda3\Lib\site-packages\sklearn\base.py:608, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, cast_to_ndarray, **check_params) 537def_validate_data(
538self,
539 X="no_validation",
(...) 544**check_params,
545 ):
546"""Validate input data and set or check the `n_features_in_` attribute. 547 548 Parameters (...) 606 validated. 607 """--> 608self._check_feature_names(X, reset=reset)
610if y isNoneandself._get_tags()["requires_y"]:
611raiseValueError(
612f"This {self.__class__.__name__} estimator " 613"requires y to be passed, but the target y is None." 614 )
File ~\anaconda3\Lib\site-packages\sklearn\base.py:535, in BaseEstimator._check_feature_names(self, X, reset) 530ifnot missing_names andnot unexpected_names:
531 message += (
532"Feature names must be in the same order as they were in fit.\n" 533 )
--> 535raiseValueError(message)
ValueError: The feature names should match those that were passed during fit.
Feature names seen at fit time, yet now missing:
- Humidity
- NO2
- PM10
- PM2.5
- Population_Density
- ...
Choosing the final model
Based on the results, we can see that the best model is the model that uses Min-Max normalization, uses weighted distance, 3-norm and 3 neighbours with accuracy of 0.942 We can graph confusion matrix, that shows how accurately we predicted each type of category. Finally, we will predict the expected accuracy on the test dataset, which will give us an estimate of how accurately the model will perform on new, unseen data.
Code
best_params_minmax
Code
max_idx = np.argmax(val_acc)max_idx_standard = np.argmax(val_acc_standard)max_idx_minmax = np.argmax(val_acc_minmax)labels = ['without_normalization', 'standardization', 'Min-Max']values = [val_acc[max_idx], val_acc_standard[max_idx_standard], val_acc_minmax[max_idx_minmax]] fig, ax = plt.subplots()colors = ['#3498db', '#2ecc71', '#e74c3c']ax.bar(labels, values, color=colors, edgecolor='black', width=0.5)ax.set_ylim(0, 1.15)ax.set_yticks([0, 0.2, 0.4, 0.6, 0.8, 1.0])ax.set_xlabel("Normalization type", fontsize=14)ax.set_ylabel("Accuracy", fontsize=14)ax.set_title("Accuracy of models based on normalization", fontsize=16)ax.grid(axis='y', linestyle='--', alpha=0.7)for i, value inenumerate(values): ax.text(i, value +0.02, f'{value:.2f}', ha='center', fontsize=12)plt.show()